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Abstract 

We propose a formalism for representation of finite languages, referred to as the class 
of IDL-expressions, which combines concepts that were only considered in isolation in 
existing formalisms. The suggested applications are in natural language processing, more 
specifically in surface natural language generation and in machine translation, where a 
sentence is obtained by first generating a large set of candidate sentences, represented in 
a compact way, and then filtering such a set through a parser. We study several formal 
properties of IDL-expressions and compare this new formalism with more standard ones. 
We also present a novel parsing algorithm for IDL-expressions and prove a non-trivial upper 
bound on its time complexity. 

1. Introduction 

In natural language processing, more specifically in applications that involve natural lan- 
guage generation, the task of surface generation consists in the process of generating an out- 
put sentence in a target language, on the basis of some input representation of the desired 
meaning for the output sentence. During the last decade, a number of new approaches for 
natural language surface generation have been put forward, called hybrid approaches. Hy- 
brid approaches make use of symbolic knowledge in combination with statistical techniques 
that have recently been developed for natural language processing. Hybrid approaches 
therefore share many advantages with statistical methods for natural language processing, 
such as high accuracy, wide coverage, robustness, portability and scalability. 

Hybrid approaches are typically based on two processing phases, described in what 
follows (Knight &: Hatzivassiloglou, 1995; Langkilde &: Knight, 1998; Bangalore &: Rambow, 
2000 report examples of applications of this approach in real world generation systems). 
In the first phase one generates a large set of candidate sentences by a relatively simple 
process. This is done on the basis of an input sentence in some source language in case 
the process is embedded within a machine translation system, or more generally on the 
basis of some logical/semantic representation, called conceptual structure, which denotes 
the meaning that the output sentence should convey. This first phase involves no or only 
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few intricacies of the target language, and the set of candidate sentences may contain many 
that are ungrammatical or that can otherwise be seen as less desirable than others. In the 
second phase one or more preferred sentences are selected from the collection of candidates, 
exploiting some form of syntactic processing that more heavily relies on properties of the 
target language than the first phase. This syntactic processing may involve language models 
as simple as bigrams or it may involve more powerful models such as those based on context- 
free grammars, which typically perform with higher accuracy on this task (see for instance 
work presented by Charniak, 2001 and references therein). 

In hybrid approaches, the generation of the candidate set typically involves a symbolic 
grammar that has been quickly hand-written, and is quite small and easy to maintain. 
Such a grammar cannot therefore account for all of the intricacies of the target language. 
For instance, frequency information for synonyms and collocation information in general 
is not encoded in the grammar. Similarly, lexico-syntactic and selectional constraints for 
the target language might not be fully specified, as is usually the case with small and mid- 
sized grammars. Furthermore, there might also be some underspecification stemming from 
the input conceptual structure. This is usually the case if the surface generation module 
is embedded into a larger architecture for machine translation, and the source language is 
underspecified for features such as definiteness, time and number. Since inferring the missing 
information from the sentence context is a very difficult task, the surface generation module 
usually has to deal with underspecified knowledge. 

All of the above-mentioned problems are well-known in the literature on natural language 
surface generation, and are usually referred to as "lack of knowledge" of the system or of the 
input. As a consequence of these problems, the set of candidate sentences generated in the 
first phase may be extremely large. In real world generation systems, candidate sets have 
been reported to contain as many as 10^^ sentences (Langkilde, 2000). As already explained, 
the second processing phase in hybrid approaches is intended to reduce these huge sets to 
subsets containing only a few sentences. This is done by exploiting knowledge about the 
target language that was not available in the first phase. This additional knowledge can 
often be obtained through automatic extraction from corpora, which requires considerably 
less effort than the development of hand-written, purely symbolic systems. 

Due to the extremely large size of the set of candidate sentences, the feasibility of hybrid 
approaches to surface natural language generation relies on 

• the compactness of the representation of a set of candidate sentences that in real world 
systems might be as large as 10^^; and 

• the efficiency of syntactic processing of the stored set. 

Several solutions have been adopted in existing hybrid systems for the representation 
of the set of candidate sentences. These include bags of words (Brown et al., 1990) and 
bags of complex lexical representations (Beaven, 1992; Brew, 1992; Whitelock, 1992), word 
lattices (Knight h Hatzivassiloglou, 1995; Langkilde h Knight, 1998; Bangalore h Rambow, 
2000), and non- recursive context-free grammars (Langkilde, 2000). As will be discussed in 
detail in Section 2, word lattices and non-recursive context-free grammars allow encoding 
of precedence constraints and choice among different words, but they both lack a primitive 
for representing strings that are realized by combining a collection of words in an arbitrary 
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order. On the other hand, bags of words allow the encoding of free word order, but in 
such a representation one cannot directly express precedence constraints and choice among 
different words. 

In this paper we propose a new representation that combines all of the above-mentioned 
primitives. This representation consists of IDL- expressions. In the term IDL-expression, 
T stands for "interleave" , which pertains to phrases that may occur interleaved, allowing 
freedom on word order (a precise definition of this notion will be provided in the next 
section); 'D' stands for "disjunction", which allows choices of words or phrases; 'L' stands 
for "lock" , which is used to constrain the application of the interleave operator. We study 
some interesting properties of this representation, and argue that the expressivity of the 
formalism makes it more suitable than the alternatives discussed above for use within hybrid 
architectures for surface natural language generation. We also associate IDL-expressions 
with IDL- graphs, an equivalent representation that can be more easily interpreted by a 
machine, and develop a dynamic programming algorithm for parsing IDL-graphs using a 
context-free grammar. If a set of candidate sentences is represented as an IDL-expression 
or IDL-graph, the algorithm can be used to filter out ungrammatical sentences from the 
set, or to rank the sentences in the set according to their likelihood, in case the context-free 
grammar assigns weights to derivations. While parsing is traditionally defined for input 
consisting of a single string, we here conceive parsing as a process that can be carried out 
on an input device denoting a language, i.e., a set of strings. 

There is a superficial similarity between the problem described above of representing 
finite sets in surface generation, and a different research topic, often referred to as discon- 
tinuous parsing. In discontinuous parsing one seeks to relax the definition of context-free 
grammars in order to represent the syntax of languages that exhibit constructions with un- 
certainty on word or constituent order (see for instance work reported by Daniels &: Meurers, 
2002 and references therein). In fact, some of the operators we use in IDL-expressions have 
also been exploited in recent work on discontinuous parsing. However, the parsing problem 
for discontinuous grammars and the parsing problem for IDL-expressions are quite differ- 
ent: in the former, we are given a grammar with productions that express uncertainty on 
constituent order, and need to parse an input string whose symbols are totally ordered; in 
the latter problem we are given a grammar with total order on the constituents appear- 
ing in each production, and need to parse an input that includes uncertainty on word and 
constituent order. 

This paper is structured as follows. In Section 2 we give a brief overview of existing 
representations of finite languages that have been used in surface generation components. 
We then discuss some notational preliminaries in Section 3. In Section 4 we introduce IDL- 
expressions and define their semantics. In Section 5 we associate with IDL-expressions an 
equivalent but more procedural representation, called IDL-graphs. We also introduce the 
important notion of cut of an IDL-graph, which will be exploited later by our algorithm. 
In Section 6 we briefly discuss the Earley algorithm, a traditional method for parsing a 
string using a context-free grammar, and adapt this algorithm to work on finite languages 
encoded by IDL-graphs. In Section 7 we prove a non-trivial upper bound on the number 
of cuts in an IDL-graph, and on this basis we investigate the computational complexity of 
our parsing algorithm. We also address some implementational issues. We conclude with 
some discussion in Section 8. 
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2. Representations of Finite Languages 

In this section we analyze and compare existing representations of finite languages that have 
been adopted in surface generation components of natural language systems. 

Bags (or multisets) of words have been used in several approaches to surface generation. 
They are at the basis of the generation component of the statistical machine translation 
models proposed by Brown et al. (1990). Bags of complex lexical signs have also been used 
in the machine translation approach described by Beaven (1992) and Whitelock (1992), 
called shake-and-bake. As already mentioned, bags are a very succinct representation of 
finite languages, since they allow encoding of more than exponentially many strings in the 
size of a bag itself. This power comes at a cost, however. Deciding whether some string 
encoded by an input bag can be parsed by a CFG is NP-complete (Brew, 1992). It is not 
difficult to show that this result still holds in the case of a regular grammar or, equivalently, 
a regular expression. An NP-completeness result involving bags has also been presented 
by Knight (1999), for a related problem where the parsing grammar is a probabilistic model 
based on bigrams. 

As far as expressivity is concerned, bags of words have also strict limitations. These 
structures lack a primitive for expressing choices among words. As already observed in 
the introduction, this is a serious problem in natural language generation, where alterna- 
tives in lexical realization must be encoded in the presence of lack of detailed knowledge 
of the target language. In addition, bags of words usually do not come with precedence 
constraints. However, in natural language applications these constraints are very common, 
and are usually derived from knowledge about the target language or, in the case of ma- 
chine translation, from the parsing tree of the source string. In order to represent these 
constraints, extra machinery must be introduced. For instance. Brown et al. (1990) impose, 
for each word in the bag, a probabilistic distribution delimiting its position in the target 
string, on the basis of the original position of the source word in the input string to be 
translated. In the shake and bake approach, bags are defined over functional structures, 
each representing complex lexical information from which constraints can be derived. Then 
the parsing algorithm for bags is interleaved with a constraint propagation algorithm to 
filter out parses (e.g., as done by Brew, 1992). As a general remark, having different layers 
of representation requires the development of more involved parsing algorithms, which we 
try to avoid in the new proposal to be described below. 

An alternative representation of finite languages is the class of acyclic deterministic fi- 
nite automata, also called word lattices. This representation has often been used in hybrid 
approaches to surface generation (Knight &: Hatzivassiloglou, 1995; Langkilde h Knight, 
1998; Bangalore &: Rambow, 2000), and more generally in natural language applications 
where some form of uncertainty comes with the input, as for instance in speech recogni- 
tion (Jurafsky &: Martin, 2000, Section 7.4). Word lattices inherit from standard regular 
expressions the primitives expressing concatenation and disjunction, and thereby allow the 
encoding of precedence constraints and word disjunction in a direct way. Furthermore, word 
lattices can be efficiently parsed by means of CFGs, using standard techniques for lattice 
parsing (Aust, Oerder, Seide, &: Steinbiss, 1995). Lattice parsing requires cubic time in 
the number of states of the input finite automaton and linear time in the size of the CFG. 
Methods for lattice parsing can all be traced back to Bar-Hillel, Perles, and Shamir (1964), 
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who prove that the class of context-free languages is closed under intersection with regular 
languages. 

One limitation of word lattices and finite automata in general is the lack of an operator 
for free word order. As we have already discussed in the introduction, this is a severe 
limitation for hybrid systems, where free word order in sentence realization is needed in 
case the symbolic grammar used in the first phase fails to provide ordering constraints. 
To represent strings where a bag of words can occur in every possible order, one has to 
encode each string through an individual path within the lattice. In the general case, this 
requires an amount of space that is more than exponential in the size of the bag. Prom 
this perspective, the previously mentioned polynomial time result for parsing is to no avail, 
since the input structure to the parser might already be of size more than exponential in the 
size of the input conceptual structure. The problem of free word order in lattice structures 
is partially solved by Langkilde and Knight (1998) by introducing an external recasting 
mechanism that preprocesses the input conceptual structure. This has the overall effect 
that phrases normally represented by two independent sublattices can now be generated 
one embedded into the other, therefore partially mimicking the interleaving of the words in 
the two phrases. However, this is not enough to treat free word order in its full generality. 

A third representation of finite languages, often found in the literature on compression 
theory (Nevill-Manning h Witten, 1997), is the class of non- recursive CFGs. A CFG 
is called non-recursive if no nonterminal can be rewritten into a string containing the 
nonterminal itself. It is not difficult to see that such grammars can only generate finite 
languages. Non-recursive CFGs have recently been exploited in hybrid systems (Langkilde, 
2000).^ This representation inherits all the expressivity of word lattices, and thus can 
encode precedence constraints as well as disjunctions. In addition, non-recursive CFGs can 
achieve much smaller encodings of finite languages than word lattices. This is done by 
uniquely encoding certain sets of substrings that occur repeatedly through a nonterminal 
that can be reused in several places. This feature turns out to be very useful for natural 
language applications, as shown by experimental results reported by Langkilde (2000). 

Although non-recursive CFGs can be more compact representations than word lattices, 
this representation still lacks a primitive for representing free word order. In fact, a CFG 
generating the finite language of all permutations of n symbols must have size at least 
exponential in n."^ In addition, the problem of deciding whether some string encoded by 
a non-recursive CFG can be parsed by a general CFG is PSPACE-complete (Nederhof &: 
Satta, 2004). 

From the above discussion, one can draw the following conclusions. In considering the 
range of possible encodings for finite languages, we are interested in measuring (i) the com- 
pactness of the representation, and (ii) the efficiency of parsing the obtained representation 
by means of a CFG. At one extreme we have the naive solution of enumerating all strings 
in the language, and then independently parsing each individual string using a traditional 
string parsing algorithm. This solution is obviously unfeasible, since no compression at all 
is achieved and so the overall amount of time required might be exponential in the size of 

1. Langkilde (2000) uses the term "forests" for non-recursive CFGs, which is a different name for the same 
concept (Billot & Lang, 1989). 

2. An unpublished proof of this fact has been personally communicated to the authors by Jeffrey Shallit 
and Ming-wei Wang. 



291 



Nederhof & Satta 



the input conceptual structure. Although word lattices are a more compact representation, 
when free word order needs to be encoded we may still have representations of exponential 
size as input to the parser, as already discussed. At the opposite extreme, we have solutions 
like bags of words or non-recursive CFGs, which allow very compact representations, but 
are still very demanding in parsing time requirements. Intuitively, this can be explained 
by considering that parsing a highly compressed finite language requires additional book- 
keeping with respect to the string case. What we then need to explore is some trade-off 
between these solutions, offering interesting compression factors at the expense of parsing 
time requirements that are provably polynomial in the cases of interest. As we will show in 
the sequel of this paper, IDL-expressions have these required properties and are therefore 
an interesting solution to the problem. 

3. Notation 

In this section we briefly recall some basic notions from formal language theory. For more 
details we refer the reader to standard textbooks (e.g., Harrison, 1978). 

For a set A, |A| denotes the number of elements in A; for a string x over some alphabet, 
|a;| denotes the length of x. For string x and languages (sets of strings) L and L', we let 
X ■ L = {xy I y G L} and L ■ L' = {xy \ x & L, y & L'}. We remind the reader that 
a string-valued function / over some alphabet S can be extended to a homomorphism 
over S* by letting /(e) = e and f{ax) = f{a)f{x) for a G i7 and x ^ E* . We also let 
/(L) = {fix) \x^L}. 

We denote a context-free grammar (CFG) by a 4-tuple G = {N ,IJ,P,S), where A^ is a 
finite set of nonterminals, i7 is a finite set of terminals, with i7nA = 0, S'GAisa special 
symbol called the start symbol, and P is a finite set of productions having the form ^4 — )■ 7, 
with A^ N and 7 G (i7U A)*. Throughout the paper we assume the following conventions: 
A,B,C denote nonterminals, a,h,c denote terminals, a, (3,^,5 denote strings in (S U A)* 
and x,y,z denote strings in S*. 

The derives relation is denoted =^g and its transitive closure =^q- The language gener- 
ated by grammar G is denoted L{G). The size of G is defined as 

\G\ = E l^«l- (1) 

4. IDL-Expressions 

In this section we introduce the class of IDL-expressions and define a mapping from such 
expressions to sets of strings. Similarly to regular expressions, IDL-expressions generate sets 
of strings, i.e., languages. However, these languages are always finite. Therefore the class of 
languages generated by IDL-expressions is a proper subset of the class of regular languages. 
As already discussed in the introduction, IDL-expressions combine language operators that 
were only considered in isolation in previous representations of finite languages exploited 
in surface natural language generation. In addition, some of these operations have been 
recently used in the discontinuous parsing literature, for the syntactic description of (infi- 
nite) languages with weak linear precedence constraints. IDL-expressions represent choices 
among words or phrases and their relative ordering by means of the standard concatenation 
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operator '•' from regular expressions, along with three additional operators to be discussed 
in what follows. All these operators take as arguments one or more IDL-expressions, and 
combine the strings generated by these arguments in different ways. 

• Operator '||', called interleave, interleaves strings resulting from its argument ex- 
pressions. A string z results from the interleaving of two strings x and y whenever z 
is composed of all and only the occurrences of symbols in x and y, and these symbols 
appear within z in the same relative order as within x and y. As an example, con- 
sider strings abed and ef g. By interleaving these two strings we obtain, among many 
others, the strings abecfgd, eabfgcd and ef abcdg. In the formal language litera- 
ture, this operation has also been called 'shuffle', as for instance by Dassow and Paun 
(1989). In the discontinuous parsing literature and in the literature on head-driven 
phrase-structure grammars (HPSG, Pollard &: Sag, 1994) the interleave operation is 
also called 'sequence union' (Reape, 1989) or 'domain union' (Reape, 1994). The 
interleave operator also occurs in an XML tool described by van der Vlist (2003). 

• Operator 'V', called disjunction, allows a choice between strings resulting from its 
argument expressions. This is a standard operator from regular expressions, where it 
is more commonly written as '+'. 

• Operator 'x', called lock, takes a single IDL-expression as argument. This operator 
states that no additional material can be interleaved with a string resulting from its 
argument. The lock operator has been previously used in the discontinuous parsing 
literature, as for instance by Daniels and Meurers (2002), Gotz and Penn (1997), 
Ramsay (1999), Suhre (1999). In that context, the operator was called 'isolation'. 

The interleave, disjunction and lock operators will also be called I, D and L operators, 
respectively. As we will see later, the combination of the I and L operators within IDL- 
expressions provides much of the power of existing formalisms to represent free word order, 
while maintaining computational properties quite close to those of regular expressions or 
finite automata. 

As an introductory example, we discuss the following IDL-expression, defined over the 
word alphabet {piano, play, must, necessarily, we}. 



||(V(necessarily, must), we • x(play • piano)). (2) 

IDL-expression (2) says that words we, play and piano must appear in that order in any 
of the generated strings, as specified by the two occurrences of the concatenation operator. 
Furthermore, the use of the lock operator states that no additional words can ever appear 
in between play and piano. The disjunction operator expresses the choice between words 
necessarily and must. Finally, the interleave operator states that the word resulting from 
the first of its arguments must be inserted into the sequence we, play, piano, in any of the 
available positions. Notice the interaction with the lock operator, which, as we have seen, 
makes unavailable the position in between play and piano. Thus the following sentences, 
among others, can be generated by IDL-expression (2): 



293 



Nederhof & Satta 



necessarily we play piano 

must we play piano 

we must play piano 

we play piano necessarily. 

However, the following sentences cannot be generated by IDL-expression (2): 

we play necessarily piano 
necessarily must we play piano. 

The first sentence is disallowed through the use of the lock operator, and the second sentence 
is impossible because the disjunction operator states that exactly one of the arguments must 
appear in the sentence realization. We now provide a formal definition of the class of IDL- 
expressions. 

Definition 1 Let U be some finite alphabet and let £ be a symbol not in U. An IDL- 
expression over U is a string n satisfying one of the following conditions: 

■K = a, with a G i7 U {£}; 

■K = x(7r'), with it' an IDL-expression; 
l^mj IT = V(7ri, 7r2, . . . , TTn), with n > 2 and tti an IDL-expression for each i, I < i < n; 
(iv) n = IKtti, 7r2, . . . , 7r„), with n > 2 and tti an IDL-expression for each i, I < i < n; 

TT = TTi ■ 'K2, with TTi and TT2 both IDL- expressions. 

We take the infix operator '•' to be right associative, although in all of the definitions in 
this paper, disambiguation of associativity is not relevant and can be taken arbitrarily. We 
say that IDL-expression tt' is a subexpression of tt if tt' appears as an argument of some 
operator in tt. 

We now develop a precise semantics for IDL-expressions. The only technical difficulty 
in doing so arises with the proper treatment of the lock operator.^ Let a; be a string over 
S. The basic idea below is to use a new symbol o, not already in S. An occurrence of o 
between two terminals indicates that an additional string can be inserted at that position. 
As an example, ii x = x' o x"x"' with a;', x" and x'" strings over S, and if we need to 
interleave x with a string y, then we may get as a result string x'yx"x"' but not string 
x' o x"yx"'. The lock operator corresponds to the removal of every occurrence of o from a 
string. 

More precisely, strings in {U U {o})* will be used to represent sequences of strings over 
S; symbol o is used to separate the strings in the sequence. Furthermore, we introduce a 
string homomorphism lock over {UU {<>})* by letting lock(a) = a iov a ^ U and lock(o) = e. 
An application of lock to an input sequence can be seen as the operation of concatenating 
together all of the strings in the sequence. 



3. If we were to add the Kleene star, then infinite languages can be specified, and interleave and lock can be 
more conveniently defined using derivatives (Brzozowski, 1964), as noted before by van der Vlist (2003). 
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We can now define the basic operation comb, which plays an important role in the sequel. 
This operation composes two sequences x and y of strings, represented as explained above, 
into a set of new sequences of strings. This is done by interleaving the two input sequences 
in every possible way. Operation comb makes use of an auxiliary operation comb', which 
also constructs interleaved sequences out of input sequences x and y, but always starting 
with the first string in its first argument x. As any sequence in comb(a::, y) must start with a 
string from x or with a string from y, comb(a;,y) is the union of comb'(a;,y) and comb' (y, a;). 
In the definition of comb', we distinguish the case in which x consists of a single string and 
the case in which x consists of at least two strings. In the latter case, the tail of an output 
sequence can be obtained by applying comb recursively on the tail of sequence x and the 
complete sequence y. For a;,y G (i7 U {*})*, we have: 

comb(a;,y) = comb'(a;,y) U comb'(y,a;) 
{xoy}, if a; G i7*; 
{x'o} ■ corr\h{x",y), 

if there are x' G U* and x" 

such that X = x' o x" . 



comb'(a;,y) 



As an example, let S = {a, b, c, d, e} and consider the two sequences a o bb o c and doe. 
Then we have 

comb(a o bb o c, d o e) = 

{aobbocodoe,aobbodocoe,aobbodoeoc, 
aodobbocoe,aodobboeoc,aodoeobboc, 
doaobbocoe,doaobboeoc,doaoeobboc, 
do e o aobbo c}. 

For languages -^1,-^2 we define comb(-Li,L2) = UxeLi,yeL2 comb(a;,|/). More generally, for 
languages Li, L2, . . . ,-Ld, d > 2, we define combf^^Lj = comb(Li,L2) for d = 2, and 
comhf^iLi = comb(comb^Jj^ Lj, L^) for d > 2. 

Definition 2 Let U he some finite alphabet. Let o he a function mapping ID L- expressions 
over S into subsets of (S U {o})*, specified by the following conditions: 

(j{a) = {a} for a E U, and (j{£) = {e}; 

cr(x(7r)) = lock(cr(7r)); 
liii) o-(V(7ri,7r2,...,7rn)) = Uf^^a{-Ki): 
(IV) o-(||(7ri,7r2,...,7r„)) = comb"^iO-(7ri); 



The set of strings that satisfy an IDL- expression n, written L{ir), is given by L{ir) 
lock(cr(7r)). 
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As an example for the above definition, we show how the interleave operator can be 
used in an IDL-expression to denote the set of all strings realizing permutations of a given 
bag of symbols. Let S = {a, b, c}. Consider a bag (a, a, b, c, c) and IDL-expression 

||(a,a,b,c,c). (3) 

By applying Definition 2 to IDL-expression (3), we obtain in the first few steps 

a(a) = {a}, 

a(b) = {b}, 

a(c) = {c}, 

cr(||(a,a)) = comb({a},{a}) = {aoa}, 

cr(||(a, a, b)) = comb({ao a}, {b}) = {b o ao a, aobo a, ao aob}. 

In the next step we obtain 3x4 sequences of length 4, each using all the symbols from 
bag (a, a, b, c). One more application of the comb operator, on this set and set {c}, pro- 
vides all possible sequences of singleton strings expressing permutations of symbols in bag 
(a, a, b, c,c). After removing symbol o throughout, which conceptually turns sequences of 
strings into undivided strings, we obtain the desired language L{\\ (a, a, b, c, c)) of permuta- 
tions of bag (a, a, b, c, c). 

To conclude this section, we compare the expressivity of IDL-expressions with that of 
the formalisms discussed in Section 2. We do this by means of a simple example. In 
what follows, we use the alphabet {NP, PP, V}. These symbols denote units standardly 
used in syntactic analysis of natural language, and stand for, respectively, noun phrase, 
prepositional phrase and verb. Symbols NP, PP and V should be rewritten into actual words 
of the language, but we use these as terminal symbols to simplify the presentation. Consider 
a language having the subject- verb-object (SVO) order and a sentence having the structure 

[s NPi V NPs], 

where NPi realizes the subject position and NP2 realizes the object position. Let PPi and 
PP2 be phrases that must be inserted in the above sentence as modifiers. Assume that we 
know that the language at hand does not allow modifiers to appear in between the verbal 
and the object positions. Then we are left with 3 available positions for the realization 
of a first modifier, out of the 4 positions in our string. After the first modifier is inserted 
within the string, we have 5 positions, but only 4 are available for the realization of a second 
modifier, because of our assumption. This results in a total of 3 x 4 = 12 possible sentence 
realizations. 

A bag of words for these sentences is unable to capture the above constraint on the 
positioning of modifiers. At the same time, a word lattice for these sentences would contain 
12 distinct paths, corresponding to the different realizations of the modifiers in the basic 
sentence. Using the IDL formalism, we can easily capture the desired realizations by means 
of the IDL-expression: 

||(PPi, PP2, NPi • x(V • NP2)). 
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Again, note the presence of the lock operator, which implements our restriction against 
modifiers appearing in between the verbal and the object position, similarly to what we 
have done in IDL-expression (2). 

Consider now a sentence with a subordinate clause, having the structure 

[s NPi Vi NP2 [s' NP3 V2 NP4]], 

and assume that modifiers PPi and PP2 apply to the main clause, while modifiers PP3 and 
PP4 apply to the subordinate clause. As before, we have 3x4 possible realizations for the 
subordinate sentence. If we allow main clause modifiers to appear in positions before the 
subordinate clause as well as after the subordinate clause, we have 4x5 possible realizations 
for the main sentence. Overall, this gives a total of 3 x 4^ x 5 = 240 possible sentence 
realizations. 

Again, a bag representation for these sentences is unable to capture the above restric- 
tions on word order, and would therefore badly overgenerate. Since the main sentence 
modifiers could be placed after the subordinate clause, we need to record for each of the 
two modifiers of the main clause whether it has already been seen, while processing the 12 
possible realizations of the subordinate clause. This increases the size of the representation 
by a factor of 2 x 2 = 4. On the other hand, the desired realizations can be easily captured 
by means of the IDL-expression: 

IKPPi, PP2, NPi • x(Vi • NP2) • x(||(PP3, PP4, NPs • x(V2 • NP4)))). 

Note the use of embedded lock operators (the two rightmost occurrences). The rightmost 
and the leftmost occurrences of the lock operator implement our restriction against modi- 
fiers appearing in between the verbal and the object position. The occurrence of the lock 
operator in the middle of the IDL-expression prevents any of the modifiers PPi and PP2 
from modifying elements appearing within the subordinate clause. Observe that when we 
generalize the above examples by embedding n subordinate clauses, the corresponding word 
lattice will grow exponentially in n, while the IDL-expression has linear size in n. 

5. IDL-Graphs 

Although IDL-expressions may be easily composed by linguists, they do not allow a direct 
algorithmic interpretation for efficient recognition of strings. We therefore define an equiva- 
lent but lower-level representation for IDL-expressions, which we call IDL-graphs. For this 
purpose, we exploit a specific kind of edge-labelled acyclic graphs with ranked nodes. We 
first introduce our notation, and then define the encoding function from IDL-expressions to 
IDL-graphs. 

The graphs we use are denoted by tuples {V,E,Vs-,V(,-,\-,r), where: 

• V and E are finite sets of vertices and edges, respectively; 

• Vg and Vg, are special vertices in V called the start and the end vertices, respectively; 

• A is the edge-labelling function, mapping E into the alphabet E U {e, h, H}; 

• r is the vertex-ranking function, mapping V to N, the set of non-negative integer 
numbers. 
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Label e indicates that an edge does not consume any input symbols. Edge labels h and H 
have the same meaning, but they additionally encode that we are at the start or end, re- 
spectively, of what corresponds to an I operator. More precisely, let tt be an IDL-expression 
headed by an occurrence of the I operator and let 7(77) be the associated IDL-graph. We 
use edges labelled by h to connect the start vertex of 7(7r) with the start vertices of all the 
subgraphs encoding the arguments of I. Similarly, we use edges labelled by H to connect 
all the end vertices of the subgraphs encoding the arguments of I with the end vertex of 
7(77). Edge labels h and H are needed in the next section to distinguish occurrences of the 
I operator from occurrences of the D and L operators. Finally, the function r ranks each 
vertex according to how deeply it is embedded into (the encoding of) expressions headed 
by an occurrence of the L operator. As we will see later, this information is necessary for 
processing "locked" vertices with the correct priority. 

We can now map an IDL-expression into the corresponding IDL-graph. 

Definition 3 Let S be some finite alphabet, and let j be a non-negative integer number. 
Each IDL-expression n over E is associated with some graph ^j{n) = {V,E,Vs,Ve,X,r) 
specified as follows: 

(i) if IT = a, a ^ U L) {£}, let Vg^v^ be new nodes; we have 

(a) V = {Vs.Ve}, 

(b) E = {{Vs,Ve)}, 

(c) \{{vs.VfD) = a for a E S and X{{vs,Ve)) = e for a = £, 

(d) r{vs) = r{ve) = j; 

(ii) if TT = x{it') with 7j_|_i(7r') = {V ,E' ,v'g,v'g,X' ,r'), let Vs^Vg be new nodes; we have 

(a) V = V'U{vs,Ve}, 

(b) E = E'u{{v,,v',),{v'„Ve)}, 

(c) A(e) = A'(e) for e G E' , A((«„wO) = MWe^Ve)) = e, 

(d) r{v) = r'{v) for v G V , r{vs) = r{ve) = j; 

(iii) if n = V(7ri, 7r2, . . . , tt^) with -fj{iTi) = {Vi,Ei,Vi^s,Vi^eAi,ri), 1 < « < «, let Vg^Vg be 
new nodes; we have 

(a) V = U^^^ViU{vs,Ve}, 

(b) E = Uf^i^i U {{vs,Vi,s) I 1 < i < n} U {(f,,e,Ue) I 1 < i < n}, 

(c) A(e) = Xi{e) for e G Et, X{{vs,Vi^s)) = M{vix,Ve)) = e for 1 < i < n, 

(d) r{v) = ri{v) for v G Vt, r{vs) = r{ve) = j; 

(iv) if n = ||(7ri,7r2,...,7rn) with -fj{ni) = {Vi,Ei,Vi^s,Vi^eAi,ri), !<'«<■"-, let Vs^Vg be 
new nodes; we have 

(a) V = [Jf^^Vi[J{vs,Vg}, 

(b) E = Uf^^Ei U {(vs^vi^s) I 1 < i < n} U {(vi^e^Ve) \ l<i<n}, 
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■^1 necessarily '^2 
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\^ 6 we ^7 e ^8 e ^9play^0 e ^ l lpiano^l2 e ^13/ 

11 11 




Figure 1: The IDL-graph associated with the IDL-expression 
||(V(necessarily, must), we • x(play • piano)). 



(c) A(e) = Ai(e) for e G Ei, \{{vs,Vi^s)) = ^ arid \{{vi^e,Ve)) = -\ for 1 <i <n, 

(d) r{v) = ri{v) for v G Vi, r{vs) = r{ve) = j; 

(v) ifn = ■KI-K2 with'yjiiTi) = {Vi,Ei,Vi^s,Vi^eAi,ri), i G {1,2}, letVs = vi^s andvg = V2,e; 
we have 

(a) F = FiUy2, 

(b) E = EiUE2U{ivi,e,V2,s)}, 

(c) A(e) = A,(e) for e e E, for i e {1,2}, X{{vi,e,V2,s)) = £, 

(d) r{v) = ri{v) for v E Vi, i G {1,2}. 

We let j{it) = 70 (vr). An IDL-graph is a graph that has the form 7(7r) for some IDL- 
expression n over S. 

Figure 1 presents the IDL-graph 7(77), where n is IDL-expression (2). 

We now introduce the important notion of cut of an IDL-graph. This notion is needed 
to define the language described by an IDL-graph, so that we can talk about equivalence 
between IDL-expressions and IDL-graphs. At the same time, this notion will play a crucial 
role in the specification of our parsing algorithm for IDL-graphs in the next section. Let 
us fix some IDL-expression n and let 7(77) = {V,E,Vs,Ve,X,r) be the associated IDL- 
graph. Intuitively speaking, a cut through 7(77) is a set of vertices that we might reach 
simultaneously when traversing 7(7r) from the start vertex to the end vertex, following the 
different branches as prescribed by the encoded I, D and L operators, and in an attempt to 
produce a string of L{n). 

In what follows we view F as a finite alphabet, and we define the set V to contain those 
strings over V in which each symbol occurs at most once. Therefore F is a finite set and 
for each string c G F we have \c\ < \V\. If we assume that the outgoing edges of each vertex 
in an IDL-graph are linearly ordered, we can represent cuts in a canonical way by means of 
strings in V as defined below. 
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Let r be the ranking function associated with 7(7r). We write c[vi ■ ■ -Vm] to denote a 
string c G F satisfying the following conditions: 

• c has the form xvi ■ ■ ■ v^y with x,y ^V and Vi ^V for 1 < « < m; and 

• for each vertex v within c and for each i, 1 < i < m, we have r{v) < r{vi). 

In words, c[vi ■ ■ ■ Vm] indicates that vertices vi,. . . ,Vm occur adjacent in c and have the 
maximal rank among all vertices within string c. Let c[vi ■ ■ ■ v^] = xvi ■ ■ ■ v^y be a string 
defined as above and let v'^- ■ ■ v'^, G F be a second string such that no symbol f^, 1 < i < m', 
appears in x or y. We write c[vi • • • Vm '■= v[ ■ ■ ■ v'^,] to denote the string xv[ ■ ■ ■ v'^iy G V . 

The reason we distinguish the vertices with maximal rank from those with lower rank is 
that the former correspond with subexpressions that are nested deeper within subexpres- 
sions headed by the L operator. As a substring originating within the scope of an occurrence 
of the lock operator cannot be interleaved with symbols originating outside that scope, we 
should terminate the processing of all vertices with higher rank before resuming processing 
of those with lower rank. 

We now define a relation that plays a crucial role in the definition of the notion of cut, 
as well as in the specification of our parsing algorithm. 

Definition 4 Let E he some finite alphabet, let n be an ID L- expression over E, and let 
7(7r) = {V,E,Vs-,Ve-,\r) he its associated IDL- graph. The relation A^f^^^ C Vx{U(j{e})xV 
is the smallest satisfying all of the following conditions: 

(i) for each c[v] G V and {v,v') G E with X{{v,v')) = X E E U {e}, we have 

{c[v],X,c[v:=v']) G A^(,); (4) 

(ii) for each c[v] G V with the outgoing edges of v being exactly (v,vi), . . . , (v,Vn) G E, in 
this order, and with X((v,Vi)) = \-, 1 < i < n, we have 



{c[vle,c[v ■.= vi---Vn]) G A^(^); (5) 

(iii) for each c[vi ■ ■ ■ Vn] G V with the incoming edges of some v E V being exactly 
(vi,v), . . . , (vn,v) G E, in this order, and with X({vi,v)) = -\, 1 < i < n, we have 

{c[vi---Vn],e,c[vi---Vn:=v]) G A^(^). (6) 

Henceforth, we will abuse notation by writing A^ in place of A^(^) . Intuitively speaking, 
relation Ajr will be used to simulate a one-step move over IDL-graph j{it). Condition (4) 
refers to moves that follow a single edge in the graph, labelled by a symbol from the alphabet 
or by the empty string. This move is exploited, e.g., upon visiting a vertex at the start of 
a subgraph that encodes an IDL-expression headed by an occurrence of the D operator. In 
this case, each outgoing edge represents a possible next move, but at most one edge can be 
chosen. Condition (5) refers to moves that simultaneously follow all edges emanating from 
the vertex at hand. This is used when processing a vertex at the start of a subgraph that 
encodes an IDL-expression headed by an occurrence of the I operator. In fact, in accordance 
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with the given semantics, all possible argument expressions must be evaluated in parallel 
by a single computation. Finally, Condition (6) refers to a move that can be read as the 
complement of the previous type of move. 

Examples of elements in A^ in the case of Figure 1 are (vs,e,voVQ) following 
Condition (5) and {v^viz^e^Vg) following Condition (6), which start and end the 
evaluation of the occurrence of the I operator. Other elements are (voVq^e^viVq), 
(iiiii9,play, uiuio) and (iiiiiia, necessarily, 1)21113) following Condition (4). Note that, e.g., 
(iiiiiio, necessarily, i!2f 10) is not an element of A^, as vg has higher rank than vi. 

We are now ready to define the notion of cut. 

Definition 5 Let U he some finite alphabet, let n be an IDL- expression over U, and let 
7(7r) = {y,E,Vs,V(,,\,r) he its associated IDL-graph. The set of all cuts 0/7(77), written 
cut(7(7r)), is the smallest subset of V satisfying the following conditions: 

(i) string Vg belongs to cut(7(7r)); 

(ii) for each c G cut(7(7r)) and {c,X,c') E A^^, string c' belongs to cut(7(7r)). 

Henceforth, we will abuse notation by writing cut(7r) for cut(7(7r)). As already remarked, 
we can interpret a cut viV2---Vk G cut(7r), Vi E V for 1 < i < A;, as follows. In the 
attempt to generate a string in L{'k), we traverse several paths of the IDL-graph 7(7r). This 
corresponds to the "parallel" evaluation of some of the subexpressions of tt, and each Vi in 
U1U2 ■ ■ -Vk refers to one such subexpression. Thus, k provides the number of evaluations that 
we are carrying out in parallel at the point of the computation represented by the cut. Note 
however that, when drawing a straight line across a planar representation of an IDL-graph, 
separating the start vertex from the end vertex, the set of vertices that we can identify is 
not necessarily a cut.^ In fact, as we have already explained when discussing relation Ajr, 
only one path is followed at the start of a subgraph that encodes an IDL-expression headed 
by an occurrence of the D operator. Furthermore, even if several arcs are to be followed at 
the start of a subgraph that encodes an IDL-expression headed by an occurrence of the I 
operator, some combinations of vertices will not satisfy the definition of cut when there are 
L operators within those argument expressions. These observations will be more precisely 
addressed in Section 7, where we will provide a mathematical analysis of the complexity of 
our algorithm. 

Examples of cuts in the case of Figure 1 are Vg, Vg, vqVg, viVe, v^ve, vqVt, etc. Strings 
such as viv^ are not cuts, as vi and v^ belong to two disjoint subgraphs with sets of vertices 
{111,112} and {u3,U4}, respectively, each of which corresponds to a different argument of an 
occurrence of the disjunction operator. 

Given the notion of cut, we can associate a finite language with each IDL-graph and 
talk about equivalence with IDL-expressions. Let tt be an IDL-expression over i7, and let 
7(7r) = (y,E,Vs-,Ve-,\-,r) be the associated IDL-graph. Let also c,c' G cut(7r) and w G E* . 
We write w G L{c,c') if there exists q > \w\, Xi E U U {s}, I < i < q, and Cj G cut(7r), 
< i < q, such that Xi- ■ ■ Xq = w, cq = c, Cq = c' and (cj-i, Xj, Cj) G A^ for 1 < i < g. 

4. The pictorial representation mentioned above comes close to a different definition of cut that is standard 
in the literature on graph theory and operating research. The reader should be aware that this standard 
graph-theoretic notion of "cut" is different from the one introduced in this paper. 
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We also assume that L{c,c) = {s}. We can then show that L{vs,Ve) = L{n), i.e., the 
language generated by the IDL-expression n is the same as the language that we obtain in 
a traversal of the IDL-graph 7(77), as described above, starting from cut Vs and ending in 
cut Vg. The proof of this property is rather long and does not add much to the already 
provided intuition underlying the definitions in this section; therefore we will omit it. 

We close this section with an informal discussion of relation Ajr and the associated notion 
of cut. Observe that Definition 4 and Definition 5 implicitly define a nondeterministic finite 
automaton. Again, we refer the reader to Harrison (1978) for a definition of finite automata. 
The states of the automaton are the cuts in cut(7r) and its transitions are given by the 
elements of Ajr. The initial state of the automaton is the cut Vg, and the final state is the 
cut Vg. It is not difficult to see that from every state of the automaton one can always reach 
the final state. Furthermore, the language recognized by such an automaton is precisely the 
language L(vs,Ve) defined above. However, we remark here that such an automaton will 
never be constructed by our parsing algorithm, as emphasized in the next section. 

6. CFG Parsing of IDL-Graphs 

We start this section with a brief overview of the Earley algorithm (Earley, 1970), a well- 
known tabular method for parsing input strings according to a given CFG. We then refor- 
mulate the Earley algorithm in order to parse IDL-graphs. As already mentioned in the 
introduction, while parsing is traditionally defined for input consisting of a single string, we 
here conceive parsing as a process that can be carried out on an input device representing 
a language, i.e., a set of strings. 

Let G = {N, U, P, S) be a CFG, and let lo = ai • • • a„ G i7* be an input string to be 
parsed. Standard implementations of the Earley algorithm (Graham &: Harrison, 1976) use 
so called parsing items to record partial results of the parsing process on w. A parsing item 
has the form [A ^ a • (3, «, j], where A — )■ a(3 is a production of G and i and j are indices 
identifying a substring aj+i ■ ■ ■ aj oi w. Such a parsing item is constructed by the algorithm 
if and only if there exist a string 7 G (iV U 17)* and two derivations in G having the form 



* 

G 


ai-- 


■UiAj 


G 


ai-- 


■ aiaPr, 


* 
G 


flj+i 


■ ■ ■ CLj. 



a 

The algorithm accepts w if and only if it can construct an item of the form [5 — )■ o; •, 0, n], for 
some production 5 — )■ a of G. Figure 2 provides an abstract specification of the algorithm 
expressed as a deduction system, following Shieber, Schabes, and Pereira (1995). Inference 
rules specify the types of steps that the algorithm can apply in constructing new items. 

Rule (7) in Figure 2 serves as an initialization step, constructing all items that can 
start analyses for productions with the start symbol S in the right-hand side. Rule (8) 
is very similar in purpose: it constructs all items that can start analyses for productions 
with nonterminal B in the left-hand side, provided that B is the next nonterminal in some 
existing item for which an analysis is to be found. Rule (9) matches a terminal a in an item 
with an input symbol, and the new item signifies that a larger part of the right-hand side 
has been matched to a larger part of the input. Finally, Rule (10) combines two partial 
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[S ^ •a,0,0] 

[A^a.B/3,i,j] 
[B ^ •7,j,j] 

[A-- a»a/3,i,j] 
[A ^ aa • /3,i,j + 1] 



{ S^a (7) 



{5^7 (8) 



{ a = aj+i (9) 



[A^a»B/3,i,j] 

[B^j;j,k] ^ ^ 

L I ,J, i ^^Q^ 



[A^aB • /3, i, k] 



Figure 2: Abstract specification of the parsing algorithm of Earley for an input string 
ai •••a„. The algorithm accepts w if and only if it can construct an item of 
the form [S ^ a •, 0, n], for some production 5 — )■ a of G. 



analyses, the second of which represents an analysis for symbol B, by which the analysis 
represented by the first item can be extended. 

We can now move to our algorithm for IDL-graph parsing using a CFG. The algorithm 
makes use of relation A^ from Definition 4, but this does not mean that the relation is 
fully computed before invoking the algorithm. We instead compute elements of A^ "on- 
the-fly" when we first visit a cut, and cache these elements for possible later use. This has 
the advantage that, when parsing an input IDL-graph, our algorithm processes only those 
portions of the graph that represent prefixes of strings that are generated by the CFG at 
hand. In practical cases, the input IDL-graph is never completely unfolded, so that the 
compactness of the proposed representation is preserved to a large extent. 

An alternative way of viewing our algorithm is this. We have already informally dis- 
cussed in Section 5 how relation A^ implicitly defines a nondeterministic finite automaton 
whose states are the elements of cut(7r) and whose transitions are the elements of A^. We 
have also mentioned that such an automaton precisely recognizes the finite language L{'k). 
From this perspective, our algorithm can be seen as a standard lattice parsing algorithm, 
discussed in Section 2. What must be emphasized here is that we do not precompute the 
above finite automaton prior to parsing. Our approach consists in a lazy evaluation of the 
transitions of the automaton, on the basis of a demand on the part of the parsing process. 
In contrast with our approach, full expansion of the finite automaton before parsing has 
several disadvantages. Firstly, although a finite automaton generating a finite language 
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might be considerably smaller than a representation of the language itself consisting of a 
list of all its elements, it is easy to see that there are cases in which the finite automaton 
might have size exponentially larger than the corresponding IDL-expression (see also the 
discussion in Section 2). In such cases, full expansion destroys the compactness of IDL- 
expressions, which is the main motivation for the use of our formalism in hybrid surface 
generation systems, as discussed in the introduction. Furthermore, full expansion of the 
automaton is also computationally unattractive, since it may lead to unfolding of parts of 
the input IDL-graph that will never be processed by the parsing algorithm. 

Let G = {N , E, P, S) be a CFG and let n be some input IDL-expression. The algorithm 
uses parsing items of the form [A ^ a • /3,ci,C2], with A — )■ a/3 a production in P and 
ci,C2 G cut(7r). These items have the same meaning as those used in the original Earley 
algorithm, but now they refer to strings in the languages L(vs,ci) and L(ci,C2), where Vs 
is the start vertex of IDL-graph j{it). (Recall from Section 5 that L{c,c'), c,c' G cut(7r), 
is the set of strings whose symbols can be consumed in any traversal of 7(77) starting from 
cut c and ending in cut c'.) We also use items of the forms [ci,C2] and [a,ci,C2], a ^ S, 
ci, C2 G cut(7r). This is done in order to by-pass traversals of 7(77) involving a sequence of zero 
or more triples of the form (ci, e, C2) G Aj^, followed by a triple of the form (ci, a, C2) G A^r. 
Figure 3 presents an abstract specification of the algorithm, again using a set of inference 
rules. The issues of control flow and implementation are deferred to the next section. 

In what follows, let Vs and Vg be the start and end vertices of IDL-graph 7(77), respec- 
tively. Rules (11), (12) and (15) in Figure 3 closely resemble Rules (7), (8) and (10) of the 
original Earley algorithm, as reported in Figure 2. Rules (13), (16) and (17) have been intro- 
duced for the purpose of efficiently computing traversals of 7(77) involving a sequence of zero 
or more triples of the form (ci, e, C2) G Aj^, followed by a triple of the form (ci, a, C2) G Aj^, 
as already mentioned. Once one such traversal has been computed, the fact is recorded 
through some item of the form [a, ci,C2], avoiding later recomputation. Rule (14) closely 
resembles Rule (9) of the original Earley algorithm. Finally, by computing traversals of 
j(n) involving triples of the form (ci,e,C2) G Aj^ only. Rule (18) may derive items of the 
form [5* — > a •,Vs,Ve]; the algorithm accepts the input IDL-graph if and only if any such 
item can be derived by the inference rules. 

We now turn to the discussion of the correctness of the algorithm in Figure 3. Our 
algorithm derives a parsing item [A ^ a • /3, ci,C2] if and only if there exist a string 
7 G (iV U i7)*, integers «, j with < i < j, and 0102 ' ' ' % G ^* such that the following 
conditions are all satisfied: 



ai---ai G L{vs,ci); 

ai_l_i • • • aj G L{ci, C2): and 

there exist two derivations in G of the form 



a 



G «!•• 


■GiA'J 


G ai-- 


■ aiaP'j 


G ^i+^ 


■■■aj. 



The above statement closely resembles the existential condition previously discussed for the 
original Earley algorithm, and can be proved using arguments similar to those presented for 
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[5 -^ • a,Vs,v, 
[A ^ a • BP,ci,C2 



{ S^a (11) 



[B -^ •7,C2,C2] 

[A ^ a • a/3, ci,C2 

[C2,C2] 



[A ^ a • a/3,ci,C2] 

[Q,C2,C3] 

[A^ aa» /3,ci,C3] 



[A ^ a • BP,ci,C2] 

[B -^7»,C2,C3] 

[^ -^ aS •/3,ci,C3] 

[C1,C2 



{5^7 (12) 



[C1,C3 
[ci,C2] r (c2,a,C3) G A^ 






[a,ci,C3] (_ a G i7 
[5 -^ a 'jCcCi 



[5-^0; •,Co,C2] 



(13) 



(14) 



fl5) 



{ (c2,e,C3)GA, (16) 



(17) 



{ (ci,e,C2)GA, (18) 



Figure 3: An abstract specification of the parsing algorithm for IDL-graphs. The algorithm 
accepts the IDL-graph 7(7r) if and only if some item having the form [5 — )■ a • 
,Vs,Ve] can be derived by the inference rules, where 5 — )■ a is a production of G 
and Vs and Vg are the start and end vertices of 7(7r), respectively. 
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instance by Aho and Ullman (1972) and by Graham and Harrison (1976); we will therefore 
omit a complete proof here. Note that the correctness of the algorithm in Figure 3 directly 
follows from the above statement, by taking item [A ^ a • /3,ci,C2] to be of the form 
[S ^ a •,Vs,Ve] for some production 5 — )■ a from G. 

7. Complexity and Implementation 

In this section we provide a computational analysis of our parsing algorithm for IDL-graphs. 
The analysis is based on the development of a tight upper bound on the number of possible 
cuts admitted by an IDL-graph. We also discuss two possible implementations for the 
parsing algorithm. 

We need to introduce some notation. Let n be an IDL-expression and let 7(7r) = 
{V,E,Vs,Ve,X,r) be the associated IDL-graph. A vertex t; G F is called L-free in 7(77) if, 
for every subexpression n' of n such that 7j(7r') = {V',E\v'g,Vg, A',r') for some j, V C V, 
E' C E, and such that v G V', we have that n' is not of the form x{n"). In words, a vertex 
is L-free in 7(77) if it does not belong to a subgraph 017(77) that encodes an IDL-expression 
headed by an L operator. When 7(77) is understood from the context, we write L-free in 
place of L-free in 7(77). We write 0-cut(7r) to denote the set of all cuts in cut(7r) that only 
contain vertices that are L-free in 7(77). We now introduce two functions that will be used 
later in the complexity analysis of our algorithm. For a cut c G cut(7r) we write \c\ to denote 
the length of c, i.e., the number of vertices in the cut. 

Definition 6 Let n be an IDL-expression. Functions width and 0-width are specified as 
follows: 

width (tt) = max \c\ , 

cGcut(7r) 

0-width(7r) = max \c\ . 

cG0-cut(7r) 

Function width provides the maximum length of a cut through 7(77). This quantity gives 
the maximum number of subexpressions of n that need to be evaluated in parallel when 
generating a string in L(Tr). Similarly, function 0-width provides the maximum length of a 
cut through 7(77) that only includes L-free nodes. 

Despite the fact that cut(7r) is always a finite set, a computation of functions width and 
0-width through a direct computation of cut(7r) and 0-cut(7r) is not practical, since these 
sets may have exponential size in the number of vertices of 7(7r). The next characterization 
provides a more efficient way to compute the above functions, and will be used in the proof 
of Lemma 2 below. 

Lemma 1 Let it be an IDL-expression. The quantities width(7r) and 0-width(7r) satisfy the 
following equations: 



(i) if IT = a, a G i7 U {£}, we have 



width(7r) = 1, 
0-width(7r) = 1; 
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(ii) if IT = x(7r') we have 



width(7r) = width(7r') 
0-width(7r) = 1; 



(iii) if n = V(7ri, 7r2, . . . , nn) we have 



width(7r) = max width(7ri), 

i—l 

0-width(7r) = max 0-width(7ri); 



(iv) if TT = \\(Tri,Tr2,...,nn) we have 



width 



[tt) = max (width (tTj) + y^ 0-width(7r.j)), 



i:l<i<nAi^j 



n 



0-width(7r) = ^ 0-width(7rj); 
i=i 

(v) if TT = TTi • 712 we have 

width(7r) = max {width (tti), width(7r2)}, 
0-width(7r) = max{0-width(7ri), 0-width(7r2)}. 

Proof. All of the equations in the statement of the lemma straightforwardly follow from 
the definitions of A^ and cut(7r) (Definitions 4 and 5, respectively). Here we develop at 
length only two cases and leave the remainder of the proof to the reader. In what follows 
we assume that 7(7r) = {V,E,Vs,Vf,,\,r). 

In case tt = V(7ri, 7r2, . . . ,7r„), let 7(77^) = {Vi.Ei.Vi^s^Vi^eAi^i'i), I < i < n. Prom 
Definition 4 we have {vg^s^Vi^s) G ^-k and (vi^e,e,Ve) G A^, for every i, 1 < i < n. Thus 
we have cut(7r) = U^_^cut(7ri) U {vs,Ve} and, since both Vs and Vg are L-free in 7(77), 
0-cut(7r) = Uf^^O-cut{ni) U {vs,Ve}. This provides the relations in (iii). 

In case n = ||(7ri, 7r2, . . . , 7rn), let 7(7ri) = {Vi,Ei,Vi^s,Vi^eAi,ri), I < i < n. Prom 
Definition 4 we have (vs,e,vi^s ' ' ' ''^n,s) G ^n and (vi^^' ' ' ''^n,e,£,Ve) G Aj^. Thus every 
c € cut(7r) must belong to {vs^Vg} or must have the form c = ci ■ ■ ■ Cn with Cj G cut(7ri) for 
1 < i < n. Since both Vs and v^ are L-free in 7(71), we immediately derive 

0-cut(7r) = {i)s,'Ue} U 0-cut(7ri) • • • 0-cut(7r„), 

and hence 0-width(7r) = Yll=i 0-width(7rj). Now observe that, for each c = ci- ■ ■ Cn specified 
as above there can never be indices i and j, I < i,j < n and i / j, and vertices vi and V2 
occurring in Cj and Cj, respectively, such that neither vi nor V2 are L-free in 7(7r). 
We thereby derive 

CUt(7r) = {Vs,Ve}U 

cut(7ri)0-cut(7r2) • • • 0-cut(7r„) U 
0-cut(7ri)cut(7r2) • • • 0-cut(7r„) U 



0-cut(7ri)0-cut(7r2) • • • CUt{lTr, 
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Hence we can write width(7r) = max'^_| (width(7rj) + X]ri<i<nAi7^7 0-width(7r.j)). D 

Now consider quantity |cut(7r)|, i.e., the number of different cuts in IDL-graph j(n). 
This quantity is obviously bounded from above by |F|™' * ^'^' . We now derive a tighter 
upper bound on this quantity. 

Lemma 2 Let S be a finite alphabet, let n be an IDL-expression over S, and let 'yin) = 
{V,E,Vs,Ve,X,r) be its associated IDL-graph. Let also k = width(7r). We have 

|cut(7r)| < ( Y 

Proof. We use below the following inequality. For any integer h > 2 and real values Xi > 0, 
1 < i < /i, we have 

In words, (19) states that the geometric mean is never larger than the arithmetic mean. 

We prove (19) in the following equivalent form. For any real values c > and y^, 
1 < i < h and h > 2, with yi > —c and X^j_^ yi = 0, we have 

h 

Ylic + Vi) < c\ (20) 

We start by observing that if the yi are all equal to zero, then we are done. Otherwise there 
must be i and j with I < i,j < h such that yiyj < 0. Without loss of generality, we assume 
i = 1 and j = 2. Since y^yj < 0, we have 

(c + yi)(c + y2) = c(c + yi + ^2) + yiy2 < c{c + yi +^2)- (21) 

Since Y[i=3 i^^ + Vi) > O5 ^^ have 

h h 

{c + yi){c + y2)J[{c + yi) < c{c + yi + yi) ]]_{€ + y^). (22) 

i=3 i=3 

We now observe that the right-hand side of (22) has the same form as the left-hand side 
of (20), but with fewer y^ that are non-zero. We can therefore iterate the above procedure, 
until all yi become zero valued. This concludes the proof of (19). 

Let us turn to the proof of the statement of the lemma. Recall that each cut c G cut(7r) 
is a string over V such that no vertex in V has more than one occurrence in c, and c is 
canonically represented, i.e., no other permutation of the vertices in c is a possible cut. We 
will later prove the following claim. 

Claim. Let tt, V and k be as in the statement of the lemma. We can partition V into 
subsets F[7r,j], I < j < k, having the following property. For every F[7r, j], 1 < j < k, and 
every pair of distinct vertices i'i,U2 G ^[tt, j], vi and V2 do not occur together in any cut 
c G cut(7r). 
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We can then write 



|cut(7r)| < rii^i l^[^5j]l (by our claim and the canonical 

representation of cuts) 



^k ' .\ '^ 



< I g'-l''''-^" I (by (19)) 



(^ 



yl\k 



To complete the proof of the lemma we now need to prove our claim above. We prove 
the following statement, which is a slightly stronger version of the claim. We can partition 
set V into subsets F[7r, j], 1 < j < k = width(7r), having the following two properties: 

• for every y[7r,j], I < j < k, and every pair of distinct vertices i'i,i'2 G F[7r,j], vi and 
V2 do not occur together in any cut c G cut(7r); 

• all vertices in V that are L-free in j{it) are included in some F[7r,j], 1 < j < 
0-width(7r). (In other words, the sets F[7r,j], 0-width(7r) < j < width(7r), can only 
contain vertices that are not L-free in 7(7r).) 

In what follows we use induction on if'opi'^)- the number of operator occurrences (I, D, L 

and concatenation) appearing within n. 

Base: #op(7r) = 0. We have tt = a, with a G i7u{£'}, and F = {us,'"/}- Since width(7r) = 1, 

we set V[-K, 1] = V. This satisfies our claim, since cut(7r) = {vg^Vf}, all vertices in V are 

L-free in ^{n) and we have 0-width(7r) = 1. 

Induction: i^opiT^) > 0. We distinguish among three possible cases. 

Case 1: TT = V(7ri, 7r2, . . . ,7r„). Let -fini) = {Vi,Ei,Vi^s,Vi^eAi,ri), 1 < « < n. By Lemma 1 

we have width(7r) = max"_^ width(7r.j). For each i, I < i < n, let us define F[7ri,j] = for 

every j such that width (ttj) < j < width (tt). We can then set 

F[7r,l] = iUf^,V[7ri,l])U{vs,v,}; 

V[n,j] = Uf^.VimJ], for2<j<width(7r). 

The sets F[7r,j] define a partition of V, since V = (U"^^ Vi) U {vs^Vg} and, for each «, the 
sets F[7ri, j] define a partition of Vi by the inductive hypothesis. We now show that such a 
partition satisfies the two conditions in our statement. 

Let vi and V2 be two distinct vertices in some F[7r,j]. We have already established in 
the proof of Lemma 1 that cut(7r) = (U"^^ cut(7ri)) U {vs,Ve}. If either vi or V2 belongs 
to the set {vs,Ve}, then vi and V2 cannot occur in the same cut in cut(7r), since the only 
cuts in cut(7r) with vertices in the set {vs,Ve} are Vs and Vg. Let us now consider the case 
ui,U2 G U"_^ Vi- We can distinguish two subcases. In the first subcase, there exists i such 
that i'i,t^2 G ^[ti'iiJ]- The inductive hypothesis states that vi and V2 cannot occur in the 
same cut in cut(7ri), and hence cannot occur in the same cut in cut(7r). In the second 
subcase, vi G F[7ri,j] and V2 G ^[7rj',j] for distinct i and i'. Then vi and V2 must belong 
to different graphs 7(77^) and 7(7ri'), and hence cannot occur in the same cut in cut(7r). 

Furthermore, every vertex in U"_^ Vi that is L-free in some j(ni) belongs to some 
F[7ri,j] with 1 < j < 0-width(7ri), by the inductive hypothesis. Since 0-width(7r) = 
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max"_^ 0-width(7ri) (Lemma 1) we can state that all vertices in V that are L-free in j{it) 

belong to some F[7r,j], 1 < j < 0-width(7r). 

Case 2: TT = x(7r') or tt = tti • 7r2. The proof is almost identical to that of Case 1, with 

n = 1 or n = 2, respectively. 

Case 3: tt = ||(7ri, 7r2, . . . ,7r„). Let ^{11.1) = {Vi,Ei,Vi^s,Vi^eAi,ri), '^ < i < n. By Lemma 1 

we have 



0-width(7r) = ^ 0-width(7rj), 



width(7r) = max (width(7rj) + y^ 0-width(7ri)). 

j — l z — / 

i:l<i<nAi^j 

The latter equation can be rewritten as 

n 

width(7r) = y^ 0-width(7r,) + max (width(7rj) - 0-width(7r,)). (23) 

For each « with 1 <i < n, let us define F[7ri,j] = for every j with width(7ri) <j< width(7r). 
We can then set 

F[7r,l] = V[nul]U{vs,v,}; 
V[nJ] = V[Ki,Jl for 2 <j < 0-width(7ri): 
F[7r,0-width(7ri) + /] = V[k2,J], for 1 <j < 0-width(7r2); 

V[^, Er=7 0-width(7ri) + j] = F[7r„, j], for 1 < j < 0-width(7r„); 
F[7r,ELiO-width(7ri)+j] = U^i^^i, 0-width(7ri) + j], 

for 1 < J < max^_^(width(7rj) — 0-width(7rj)). 

The sets F[7r,j] define a partition of V, since V = (U"_^ Vi) U {vg^Vg} and, for each «, the 
sets F[7ri, j] define a partition of Vj by the inductive hypothesis. We now show that such a 
partition satisfies both conditions in our statement. 

Let vi and V2 be distinct vertices in some F[7r, j], 1 < j < n. We have already established 
in the proof of Lemma 1 that a cut c in cut(7r) either belongs to {vs,Ve} or else must have 
the form c = ci • • • c„ with Cj G cut(7ri) for 1 < i < n. As in Case 1, if either vi or V2 belongs 
to the set {vs,Ve}, then vi and V2 cannot occur in the same cut in cut(7r), since the only 
cuts in cut(7r) with vertices in the set {vs,Ve} are Vs and Vg. Consider now the case in which 
ui,U2 G U^_^ Vi- We distinguish two subcases. 

In the first subcase, there exists i such that i'i,i'2 G F[7ri,j]. If there exists a cut 
c G cut(7r) such that vi and V2 both occur within c, then vi and V2 must both occur within 
some c' G cut(7ri). But this contradicts the inductive hypothesis on ttj. 

In the second subcase, vi G V[nii,j'] and V2 G F[7ri",j"], for distinct i' and i". Note 
that this can only happen if 0-width(7r) < j < width (tt), 0-width(7r,j/) < j' < width (ttj/) and 
0-width(7ri") < j" < width (ttj"), by our definition of the partition of V and by (23). By the 
inductive hypothesis on ttj/ and ttj//, vi is not L-free in j(nii) and V2 is not L-free in 7(7rj//), 
which means that both vi and V2 occur within the scope of some occurrence of the lock 
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operator. Note however that vi and V2 cannot occur within the scope of the same occurrence 
of the lock operator, since they belong to different subgraphs 7(7rj/) and 7(7r,j//). Assume 
now that there exists a cut c G cut(7r) such that vi and V2 both occur within c. This would 
be inconsistent with the definitions of A^ and cut (Definitions 4 and 5, respectively) since 
two vertices that are not L-free and that are not within the scope of the same occurrence 
of the lock operator cannot belong to the same cut. 

Finally, it directly follows from the definition of our partition on V and from the in- 
ductive hypothesis on the ttj that all vertices in V that are L-free in 7(77) belong to some 
F[7r,j] with 1 < j < 0-width(7r). This concludes the proof of our statement. D 

The upper bound reported in Lemma 2 is tight. As an example, for any i > 1 and k >2, 
let Si^k = {«!? • • • ) cii-k}- Consider now the class of IDL-expressions 

7I"i,A; = II (aia2 • • • flj, ai+l«i+2 • • • a2i, • • • , ai.(A;-l) + l«i.(A;-l)+2 • • • O-i-k)- 

Let also Vi^k be the vertex set of the IDL-graph 7(7rj^jfc). It is not difficult to see that 

\Vi^k\ = 2 • « • A; + 2, \N\dth{ni^k) = k and 

|cut(7r,,,)| = (2•'0'^ + 2<(2•'i + ^)^ 

where the inequality results from our upper bound. The coarser upper bound presented 
before Lemma 2 would give instead |cut(7ri^A;)| < (2 • « • A; + 2)*^. 

We can now turn to the discussion of the worst case running time for the algorithm in 
Figure 3. To simplify the presentation, let us ignore for the moment any term that solely 
depends on the input grammar G. 

To store and retrieve items [A ^ a • /3, ci,C2], [a, ci,C2] and [ci,C2] we exploit some 
data structure T and access it using cut ci and cut C2 as indices. In what follows we make 
the assumption that each access operation on T can be carried out in an amount of time 
0{d{k)), where k = width(7r) and d is some function that depends on the implementation 
of the data structure itself, to be discussed later. After we access T with some pair ci, C2, 
an array is returned of length proportional to |G|. Thus, from such an array we can inquire 
in constant time whether a given item has already been constructed. 

The worst case time complexity is dominated by the rules in Figure 3 that involve the 
maximum number of cuts, namely rules like (15) with three cuts each. The maximum 
number of different calls to these rules is then proportional to |cut(7r)| . Considering our 
assumptions on T, the total amount of time that is charged to the execution of all these 
rules is then 0{d{k) |cut(7r)| ). As in the case of the standard Earley algorithm, when the 
working grammar G is taken into account we must include a factor of |G| , which can be 
reduced to |G| using techniques discussed by Graham, Harrison, and Ruzzo (1980). 

We also need to consider the amount of time required by the construction of relation 
A^, which happens on-the-fly, as already discussed. This takes place at Rules (16), (17) and 
(18). Recall that elements of relation Ajr have the form (ci,X, C2) with ci,C2 G cut(7r) and 
X E UU {e}. In what follows, we view A^ as a directed graph whose vertices are cuts, and 
thus refer to elements of such a relation as (labelled) arcs. When an arc in A^ emanating 
from a cut ci with label X is visited for the first time, then we compute this arc and the 
reached cut, and cache them for possible later use. However, in case the reached cut C2 
already exists because we had previously visited an arc {c[,X',C2), then we only cache the 

311 



Nederhof & Satta 



new arc. For each arc in Ajr, all the above can be easily carried out in time 0{k), where 
k = width (tt). Then the total time required by the on-the-fly construction of relation A^ is 
0{k |A^|). For later use, we now express this bound in terms of quantity |cut(7r)|. From the 
definition of A^ we can easily see that there can be no more than one arc between any two 
cuts, and therefore |A^| < |cut(7r)| . We obviously have k < \V\. Also, it is not difficult to 
prove that \V\ < |cut(7r)|, using induction on the number of operator occurrences appearing 
within TT. We thus conclude that, in the worst case, the total time required by the on-the-fly 
construction of relation Ajr is C'(|cut(7r)| ). 

From all of the above observations we can conclude that, in the worst case, the algorithm 
in Figure 3 takes an amount of time 0(|G| d{k) |cut(7r)|' ). Using Lemma 2, we can then 
state the following theorem. 

Theorem 1 Given a context-free grammar G and an WL-graph 7(77) with vertex set V 
and with k = width(7r), the algorithm in Figure 3 runs in time 0(1^1 d{k){^)'^'^). 

We now more closely consider the choice of the data structure T and the issue of its 
implementation. We discuss two possible solutions. Our flrst solution can be used when 
|cut(7r)| is small enough so that we can store |cut(7r)| pointers in the computer's random- 
access memory. In this case we can implement T as a square array of pointers to sets of 
our parsing items. Each cut in cut(7r) is then uniquely encoded by a non-negative integer, 
and such integers are used to access the array. This solution in practice comes down to 
the standard implementation of the Earley algorithm through a parse table, as presented 
by Graham et al. (1980). We then have d{k) = 0(1) and our algorithm has time complexity 

As a second solution, when |cut(7r) | is quite large, we can implement T as a trie (Gusfleld, 
1997). In this case each cut is treated as a string over set V , viewed as an alphabet, and we 
look up string ci#C2 in T {# is a symbol not in V) in order to retrieve all items involving 
cuts ci and C2 that have been induced so far. We then obtain d{k) = 0{k) and our algorithm 
has time complexity 0{\G\k{'-j^)^'^). 

The flrst solution above is faster than the second one by a factor of k. However, the flrst 
solution has the obvious disadvantage of expensive space requirements, since not all pairs 
of cuts might correspond to some grammar constituent, and the array T can be very sparse 
in practice. It should also be observed that, in the natural language processing applications 
discussed in the introduction, k can be quite small, say three or four. 

To conclude this section, we compare the time complexity of CFG parsing as traditionally 
deflned for strings and the time complexity of parsing for IDL-graphs. As reference for string 
parsing we take the Earley algorithm, which has already been presented in Section 6. By a 
minor change proposed by Graham et al. (1980), the Earley algorithm can be improved to 
have time complexity OdG] • n'^), where G is the input CFG and n is the length of the input 
string. We observe that, if we ignore the factor d{k) in the time complexity of IDL-graph 
parsing (Theorem 1), the two upper bounds become very similar, with function (■'-j^) in 
IDL-graph parsing replacing the input sentence length n from the Earley algorithm. 

We observe that function (^)^ can be taken as a measure of the complexity of the 
internal structure of the input IDL-expression. More speciflcally, assume that no precedence 
constraints at all are given for the words of the input IDL-expression. We then obtain IDL- 
expressions with occurrences of the I operator only, with a worst case of A; = ^ — 1. 
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Then ©((-L^)^) can be written as 0{a^') for some constant c > 1, resulting in exponential 
running time for our algorithm. This comes at no surprise, since the problem at hand then 
becomes the problem of recognition of a bag of words with a CFG, which is known to be 
NP-complete (Brew, 1992), as already discussed in Section 2. 

Conversely, no I operator may be used in the IDL-expression tt, and thus the resulting 
representation matches a finite automaton or word lattice. In this case we have k = 1 and 
function (^) becomes \V\. The resulting running time is then a cubic function of the 
input length, as in the case of the Earley algorithm. The fact that (cyclic or acyclic) finite 
automata can be parsed in cubic time is also a well-known result (Bar-Hillel et al., 1964; 
van Noord, 1995). 

It is noteworthy to observe that in applications where k can be assumed to be bounded, 
our algorithm still runs in polynomial time. As already discussed, in practical applications 
of natural language generation, only few subexpressions from n will be processed simulta- 
neously, with k being typically, say, three or four. In this case our algorithm behaves in a 
way that is much closer to traditional string parsing than to bag parsing. 

We conclude that the class of IDL-expressions provides a flexible representation for bags 
of words with precedence constraints, with solutions in the range between pure word bags 
without precedence constraints and word lattices, depending on the value of width(7r). We 
have also proved a fine-grained result on the time complexity of the CFG parsing problem 
for IDL-expressions, again depending on values of the parameter width (tt). 

8. Final Remarks 

Recent proposals view natural language surface generation as a multi-phase process where 
finite but very large sets of candidate sentences are first generated on the basis of some input 
conceptual structure, and then filtered using statistical knowledge. In such architectures, it 
is crucial that the adopted representation for the set of candidate sentences is very compact, 
and at the same time that the representation can be parsed in polynomial time. 

We have proposed IDL-expressions as a solution to the above problem. IDL-expressions 
combine features that were considered only in isolation before. In contrast to existing 
formalisms, interaction of these features provides enough flexibility to encode strings in 
cases where only partial knowledge is available about word order, whereas the parsing 
process remains polynomial in practical cases. 

The recognition algorithm we have presented for IDL-expressions can be easily extended 
to a parsing algorithm, using standard representations of parse forests that can be extracted 
from the constructed parse table (Lang, 1994). Furthermore, if the productions of the CFG 
at hand are weighted, to express preferences among derivations, it is easy to extract a parse 
with the highest weight, adapting standard Viterbi search techniques as used in traditional 
string parsing (Viterbi, 1967; Teitelbaum, 1973). 

Although we have only considered the parsing problem for CFGs, one may also parse 
IDL-expressions with language models based on finite automata, including n-gram mod- 
els. Since finite automata can be represented as right-linear context-free grammars, the 
algorithm in Figure 3 is still applicable. 

Apart from natural language generation, IDL-expressions are useful wherever uncer- 
tainty on word or constituent order is to be represented at the level of syntax and has to be 
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linearized for the purpose of parsing. As already discussed in the introduction, this is an 
active research topic both in generative linguistics and in natural language parsing, and has 
given rise to several paradigms, most importantly immediate dominance and linear prece- 
dence parsing (Gazdar, Klein, Pullum, h Sag, 1985), discontinuous parsing Daniels and 
Meurers (2002), Ramsay (1999), Suhre (1999) and grammar linearization (Gotz &: Penn, 
1997; Gotz h Meurers, 1995; Manandhar, 1995). Nederhof, Satta, and Shieber (2003) 
use IDL-expressions to define a new rewriting formalism, based on context-free grammars 
with IDL-expressions in the right-hand sides of productions. By means of this formalism, 
fine-grained results were proven on immediate dominance and linear precedence parsing.^ 

IDL-expressions are similar in spirit to formalisms developed in the programming lan- 
guage literature for the representation of the semantics of concurrent programs. More 
specifically, so called series-parallel partially ordered multisets, or series-parallel pomsets, 
have been proposed by Gischer (1988) to represent choice and parallelism among processes. 
However, the basic idea of a lock operator is absent from series-parallel pomsets. 
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